This replication package contains the code and dataset for:

Kaylea Champion and Benjamin Mako Hill. 2023. Taboo and Collaborative Knowledge Production: Evidence from Wikipedia. Proc. ACM Hum.-Comput. Interact. 7, CSCW2, Article 299 (October 2023), 25 pages. https://doi.org/10.1145/3610090 doi: 10.1145/3610090

The replication package is distributed at:

Champion, Kaylea, 2023, "Replication Data for: Taboo and Collaborative Knowledge Production: Evidence from Wikipedia", https://doi.org/10.7910/DVN/5OKEEO, Harvard Dataverse


----------------------------------------------------------------------------------------
|                                                                
| NOTE: This is a long pipeline requiring hundreds of hours of CPU time!                                  
|
| If you don't want to use HPC resources but would like to jump straight to the final 
| dataset and R code, skip to step 8.
|
----------------------------------------------------------------------------------------


If you'd like to explore this project using a more complete version of the pipeline and build process, this replication package attempts to balance transparency with practical concerns; the dataset was derived from very large data sources and developed using a range of high-performance computing resources at the University of Washington and Northwestern University; a full replication is likely to require hundreds of hours of compute time on multiprocessor clusters. Provided datasets from intermediate stages are included where marked with ==>. Please contact the author if you have any questions or need materials from intermediate stages that are not provided.

Prerequisites: 
		- Apache Spark
		- TSV of all Wikipedia revisions, parsed from the XML data at: 
			https://dumps.wikimedia.org/enwiki/, using the wikiq wiki-parsing software
		- TSV of all current article titles (also from dumps.wikimedia.org)
		- views of all articles by month, corrected for redirects using the procedure described in 
			"Benjamin Mako Hill and Aaron Shaw. 2014. Consider the Redirect: A Missing Dimension 
			of Wikipedia Research. In Proceedings of The International Symposium on Open 
			Collaboration (OpenSym '14). Association for Computing Machinery, New York, NY, USA, 
			1–4. https://doi.org/10.1145/2641580.2641616", indexed by encoded title
		- quality of all articles by month, indexed by encoded title
		- page protection spells of all articles as calculated using the procedure described in
					"Benjamin Mako Hill and Aaron Shaw. 2015. Page protection: another 
					missing dimension of Wikipedia research. In Proceedings of the 11th 
					International Symposium on Open Collaboration (OpenSym '15). 
					Association for Computing Machinery, New York, NY, USA, Article 15, 
					1–4. https://doi.org/10.1145/2788993.2789846"


Step 0: Parse Wikipedia Article and View Data, Calculate Article Quality. 
	Code:
		- step0/wikiq_postproc_nthuser.py -- wikiq post-processing script 
	Outputs:
		- post-processed TSV of all Wikipedia revisions containing user experience levels (nth edit)

Step 1: Obtain and Parse Wiktionary
	Inputs and Prerequisites:
		- Download pre-parsed Wiktionary data from https://https://kaikki.org/dictionary/ -- 
			a .json file of pre-parsed data (1.1 G) is available from: 
			https://kaikki.org/dictionary/English/index.html
	Code:
		- step1/cleanupWikt.py
		- step1/parseWikt.py
	Outputs a parsed version of Wiktionary with taboo marked
		==> processed_data/narrow_cleanedParsedWikts.tsv

Step 2: Classify Wiktionary entries and identify set of 'dictionary salient' words
	Code:
		- step2/ridge_skl_article.py -- runs the classifier
		- step2/salienceCutDown.py -- identifies dictionary salient words
	Outputs dictionary-salient titles

Step 3: Draw samples from Wikipedia
	Code:
		- step3/bySalienceSampler.py -- draws dictionary salient sample
		- step3/euphSampler.py -- draws euphemistic i.e. taboo sample
	Outputs samples for analysis.
		==> processed_data/euphSample.tsv
		==> processed_data/tabooSample.tsv

Step 4: Extract revision-level data from Wikipedia for the sample
	Code: 
		- step4/lookupAllRevdata.py -- runs against each sample
	Outputs folders of semi-raw data, culled from parsed population-level data from step 0: 
		revision data, view data, and quality data

Step 5: Extract user data from Wikipedia
	Code:
		- step5/wikiq_postproc_userpages.py -- extracts length of editor's user page at time of edit for all revisions
		- step5/getJustUsernames.py (extracts username from sample)
		- step5/getUserpageLengthAtEdit.py (extracts userpage length for just the sample)
		- step5/checkMailability.py (calls API)
	Outputs user data -- userpage length, mailability, gender.
		

Step 6: Extract article data from Wikipedia; identify categories and protection levels
	Code:
		- step6/getCats.py -- makes API calls to get article categories; run once per sample
	Outputs article category data; also need protection spells data from prerequisite.
		

Step 7: Glue together data sources, 
	Inputs and Prerequisites:
		uses helper code from libs/lib-00-utils.R
	Code:
		step7/prepDF1.R (also calls prepCategoryData.R, justDamaging.R, and prepUserData.R)
		step7/prepDF2.R
		step7/anonymize.R -- shows how I stripped out identifiers
	Outputs:
		==> processed_data/dataset1.RData
		==> processed_data/dataset2.RData

Step 8: run statistical analysis and build figures
	Inputs and Prerequisites:
		uses helper code from libs/lib-00-utils.R
	Code:
		step8/standalone.R -- this is the core analytical file
		step8/vizBuild.R 
		step8/viewdataViz.R
		step8/onlineSupplement.Rmd
	Outputs figures and RData files used to build the paper in overleaf
		==> knitr_data/knitr_data.RData


